Precomputing grammar masks: turn a GLR stack into an LLM token mask

I want to share a take on grammar‑constrained generation I’ve been working on for a while. The punchline is:

You can compute “which LLM tokens are valid next?” by reading the parser stack through a precomputed weighted automaton—instead of simulating the parser for every candidate token.

At runtime, mask generation becomes a tight loop of:

table lookups
bitset AND / OR
reading only a bounded suffix of the stack/GSS

…and not “run the lexer+parser on a huge trie of token bytes every step.”

This post focuses on the runtime idea and the core compilation decomposition (token→terminal mapping, per‑terminal stack behavior, composition). The gnarly “how to actually build these automata without blowing up determinization/memory” details are real, and I’ll likely write those up separately.

The problem

You have a grammar (often “JSON Schema”, but let’s just say some CFG-ish constraint) and you want the LLM to produce syntactically valid output.

At each decoding step, the model produces logits over a fixed vocabulary $V$ (often ~100k–200k tokens). You sample one token, append it, repeat.

Constrained decoding means: don’t sample tokens that would make the output impossible to complete into a valid string.

Formally, if $w$ is the already‑generated token prefix, token $v \in V$ is valid if it can still be extended to a full grammar string:

v \text{ is valid given } w \iff \exists u.\; wvu \in L(G).

So the mask we want is:

\text{GetMask}(w) = \{ v \in V : \exists u.\; wvu \in L(G)\}.

This “exists a completion” definition is the right one for decoding: you don’t just want “does the prefix parse so far?”; you want “did I dead‑end the grammar?”.

Two primitive operations

Think of the constraint system as two functions:

get_mask(state) -> bitset(V)
commit(state, token) -> state

At runtime, decoding looks like:

repeat:
  (GPU) compute logits
  (CPU) compute mask
  apply mask to logits
  sample token
  commit token to both:
    - the model (KV cache / etc.)
    - the constraint state

You can overlap GPU and CPU work, but the mask is still on the critical path: if it’s late, the GPU waits.

Why worst‑case latency matters (not average)

Inference is batched. A single “slow” request can introduce bubbles or stall a batch scheduler.

So the question isn’t:

“Can we compute masks fast on average?”

It’s closer to:

“Can we compute masks under a fixed budget every single step?”

If you’re running high‑throughput decoding, your per‑token time budget can easily be in the sub‑millisecond to low‑millisecond range per stream. Even if your GPU forward pass is longer than 1ms, mask generation still needs to be stable: schedulers and batching strategies hate unpredictable tail latency.

Mask generation is also frequently branchy and cache‑unfriendly in naïve designs—exactly the kind of workload that looks fine at p50 and awful at p99.9.

Commit is incremental parsing (GLR + a graph‑structured stack)

The commit(token) side is “incremental parsing”: you feed the newly generated bytes to a lexer+parser and update the parse configuration.

The parsing model I’m assuming here is:

an LR family parser (LR(1)/LALR/etc.)
generalized to handle ambiguity / conflicts via GLR
using a graph‑structured stack (GSS) (Tomita)

The important pieces:

The LR parser stack stores parser state IDs (integers).
Decisions are table‑driven: ACTION[top_state, lookahead_terminal].
Reductions pop some number of states, then do a GOTO.
GLR handles conflicts/ambiguity by keeping many possible stacks compactly in a DAG (the GSS).

So the constraint state is something like:

a set/DAG of LR stacks (GSS)
plus lexer state(s) (more on this below)

If you’ve seen tree‑sitter, Lezer, or classic GLR literature, this is that world.

Glossary (because “token” is overloaded)

LLM token: an element of the model vocabulary (BPE token ID + byte string). $|V|$ is large (~100k–200k).
Grammar terminal: the symbols the parser consumes (could be character‑level tokens, or lexer tokens like PLUS, IDENT, STRING, etc.).
Parser state ID: the LR automaton state stored on the stack.

The real problem: generating masks over LLM tokens

Given the current constraint state, we want a mask over LLM tokens.

A naïve statement is:

“A token is allowed if appending it keeps the parse valid.”

But the parser doesn’t consume LLM tokens. It consumes grammar terminals.

And the mapping is messy:

One LLM token may produce multiple grammar terminals (e.g. "true," might be TRUE then COMMA).
An LLM token may end mid‑terminal (e.g. half of a string literal or half of an identifier).
Lexing itself can be ambiguous in streaming settings (you don’t know if a match is final until you see the next byte).

So “check token $t$ ” actually means:

feed its bytes into the grammar lexer (starting from the current lexer state)
possibly emit one or more terminals (or partials)
for each emitted terminal, update the parser (shift/reduce/goto), possibly branching GLR stacks
ensure we did not dead‑end and that a completion still exists

Doing that for a single token is manageable.

Doing that for 200k tokens every step is not.

Why the obvious approaches hurt at tail latency

A common direction is:

Build a trie/trellis of all vocabulary token byte strings.
Traverse it while simulating the lexer+parser, pruning invalid branches.
Collect which vocabulary tokens survive.

This avoids a literal “for each token” loop and shares prefixes.

It can work decently for many grammars.

But worst‑case behavior is still brutal, for a few reasons:

1) Long tokens force deep traversal

Modern vocabularies contain weird long tokens: repeated punctuation, long runs of spaces, long runs of digits, etc. (If you’ve inspected cl100k_base/cl200k_base, you’ve seen these.)

If a token is, say, 80–120 bytes long and those bytes are syntactically valid in the current context, then a trie traversal must still walk deep to confirm it.

This is already annoying even for deterministic parsing.

2) Reductions can chain, and GLR bookkeeping is pointer‑heavy

In LR parsing, a single terminal can trigger:

shift immediately, or
a chain of reductions before the shift

Those reductions pop stack frames and consult GOTOs, and in GLR they can create fan‑out.

Even if average behavior is fine, the tail can be dominated by:

lots of reduction steps
lots of GSS node/edge churn
cache misses and branch mispredicts

If your mask algorithm is “simulate parsing along many trie edges”, you’re multiplying this work.

3) The inner loop is the wrong shape

In p99.9 land you want inner loops that look like:

array indexing
predictable loops
bitwise operations

Trie traversal + GLR is the opposite: pointer chasing, branching, irregular memory access.

I tried hard to engineer this style of approach into something stable. In my experience it’s a dead end if you care about strict, predictable latency.

Invert the question: stack → tokens (not tokens → stack)

Token validity depends on the current parse configuration—especially what’s on the LR stack.

So instead of asking:

“For each token $t$ , does it work on this stack?”

ask:

“Given this stack, which tokens work?”

That suggests a precomputed function:

f(\text{stack}, \text{lexer state}) \to \text{bitset over } V

…which we can evaluate fast at runtime.

The key enabling fact from LR theory is:

The set of viable prefixes (strings that can appear on the parser stack during a valid parse) forms a regular language (Knuth).

Informally: although the grammar language is context‑free, the space of possible stacks is regular, recognized by a finite automaton (the LR(0) automaton).

So there exists a finite automaton that can read (a suffix of) the stack and decide things about future validity.

We’ll use that to compute a mask, not just accept/reject.

Weighted automata over parser state IDs

Here’s the core data structure:

Input alphabet: LR parser state IDs (the things stored on the stack)
Output: a bitset over the vocabulary
Combine constraints by intersection (bitwise AND)
Combine alternatives by union (bitwise OR)

Think “finite automaton with bitset weights.”

The semiring intuition

We use bitsets $2^V$ with:

“addition” = union ( $\cup$ ) → OR
“multiplication” = intersection ( $\cap$ ) → AND

This fits the intuition:

As you read more of the stack, you learn more constraints → you filter tokens → AND.
If there are multiple ways to be valid (e.g., GLR ambiguity), a token is allowed if any path allows it → OR.

What the automaton does at runtime

At mask time, we:

Start in an initial automaton state (depends on the current lexer/tokenizer state).
Consume stack state IDs from top to bottom.
Intersect bitset weights as we go.
Stop early once further stack depth cannot change the result (bounded lookback).
Output the resulting vocabulary bitset.

This is the “read‑the‑stack” idea.

Runtime algorithm (single stack)

If the automaton is deterministic, the conceptual runtime is almost embarrassingly simple:

def get_mask_single_stack(dwa, lexer_state, stack_top_to_bottom):
    q = dwa.start_state_for_lexer(lexer_state)
    mask = ALL_TOKENS_BITSET

    for sid in stack_top_to_bottom:
        q, w = dwa.step(q, sid)     # table lookup
        mask &= w                  # bitset AND

        if dwa.can_stop_here(q, mask):
            break

    mask &= dwa.final_weight(q)
    return mask

In practice you may still track a small frontier (e.g., multiple lexer states, or running over a GSS), but the inner operations stay “lookup + AND/OR”.

Runtime algorithm (GLR / GSS)

With GLR you don’t have one stack; you have a DAG of stacks (the GSS). Each path from a top node down to the root corresponds to one possible LR stack.

For get_mask, you want:

a token is valid if it is valid on any surviving parse stack.

So we run the same propagation, but over a graph:

Each GSS node has a parser state ID.
Edges point to predecessors (down the stack).
We propagate automaton states + bitsets through this DAG.
When paths merge (shared suffix), we OR their bitsets.

The shape is the same: $\cap$ along edges, $\cup$ at merges.

You can implement it as a dynamic program / worklist over (gss_node, automaton_state) pairs. The key is: it’s still just table lookups and bitset ops; no “simulate parser transitions for every vocab token”.

Early stopping: don’t read the whole stack

Two different (but compatible) reasons we can stop early:

Theory / structure: For well‑behaved grammars, deciding what can happen next depends only on a bounded suffix of the stack. (More below.)
Engineering: Even if the stack is deep, the mask often becomes “determined” after a few steps. Once the automaton reaches a region where deeper symbols can’t further filter the surviving token set, you stop.

A useful way to engineer this is to make the automaton explicitly track “already accepted tokens” versus “still undecided tokens,” so you can stop when the undecided set becomes empty. Conceptually:

At each automaton state, some tokens are already known valid regardless of deeper stack.
Once a token is in that “final” set, you can remove it from further intersections (otherwise you keep AND-ing bits that won’t change the outcome).
When nothing remains that can change, you stop.

This is one of those optimizations that’s obvious in hindsight: you want to avoid doing work on bits whose fate is already fixed.

Where does the automaton come from?

So far I’ve described the runtime object as if it just exists.

The compilation story has three pieces:

Terminal DWA: maps LLM tokens → sequences of grammar terminals (plus lexer state evolution).
Template automata: for each grammar terminal, describe how it can transform the LR stack (reductions + shift).
Composition: splice those together into a single automaton that reads stack state IDs and outputs a token mask.

I’ll sketch each. This is the minimum detail needed to make the idea “real” without diving into all the compile-time engineering.

1) Token → terminal mapping (Terminal DWA)

A grammar is defined over terminals. A language model produces LLM tokens.

These symbol sets don’t line up.

So we precompute, for each lexer/tokenizer state, which vocabulary tokens can produce which terminal sequences.

What makes this hard?

Tokens are byte strings.
The grammar lexer is some kind of regex/DFA/NFA machinery over bytes/characters.
A token can emit 0, 1, or many terminals.
Lexing can be streaming/ambiguous (e.g. “I might have matched + but it could become ++ if I see another +”).

The Terminal DWA idea

Build a trie over all vocabulary token byte sequences. Then simulate the grammar lexer over the trie.

Whenever the lexer recognizes a terminal on some trie path, record:

which terminal it was
which vocabulary tokens have a path that reaches that recognition

This gives you an automaton whose:

edges are labeled by grammar terminals
edge weights are bitsets of vocabulary tokens that can realize that terminal at that point

Then determinize it (subset construction) to merge equivalent lexer configurations.

The result is: given a starting lexer state, you can walk through “what terminals could this token emit?” while carrying along exactly which vocab tokens correspond to those terminals.

You can think of this as a reversed transducer: instead of mapping token → terminals, it maps terminal-sequence paths → sets of tokens that realize them.

2) Terminal → stack behavior (Template automata)

Now assume a grammar terminal $t$ has been recognized.

What happens to the LR parser stack?

In LR parsing, on lookahead terminal $t$ , the parser consults the ACTION table at the current top-of-stack state:

shift: push a new state; consume $t$
reduce: pop $k$ states; push a goto state; do not consume $t$ ; then repeat
error: dead end
(maybe accept)

So before you can shift $t$ , you may need a chain of reductions. That chain depends on deeper stack elements because reductions pop.

Key observation

Even though the LR stack can be arbitrarily deep, the behavior “what reductions happen before shifting $t$ ?” depends only on some suffix of the stack.

You can capture all possible reduction chains for terminal $t$ as a finite graph that:

reads stack states from the top
emits a sequence of pops then pushes (the net stack transformation)

This is basically a finite automaton with a “stack effect” register.

Modeling stack effects cleanly: push/pop words and cancellation

It’s convenient to represent stack operations as a word over symbols:

pop state $s$ : emit $s$
push state $s$ : emit $s^{-1}$

Concatenating two terminal effects concatenates these words.

Then you apply the obvious cancellation rule:

pushing $s$ then popping $s$ cancels
pushing $s$ then popping $t \neq s$ is impossible (that path dies)

Algebraically this is the polycyclic monoid; practically it’s “well-typed push/pop cancellation.”

Why do we care? Because LLM tokens often correspond to multiple terminals; composing per-terminal stack effects and canceling adjacent push/pop pairs massively simplifies what you need to inspect on the original stack.

Template automaton (“Template VA”)

For each terminal $t$ , we build a small NFA-like object whose accepting paths correspond to:

some number of pops (reading the existing stack)
then some number of pushes (states the parser would push due to gotos/shifts)

This automaton is derived directly from the LR parse table: edges correspond to “this state shifts $t$ ” or “this state reduces by production $p$ ” and so on.

Determinize it so that later composition is manageable.

3) Compose everything into the Parser DWA (stack → token mask)

Now we have:

Terminal DWA edges labeled by grammar terminals, weighted by vocab-token bitsets.
For each grammar terminal $t$ , a Template automaton describing how $t$ interacts with the LR stack.

To get “stack → vocab mask”:

Take the Terminal DWA.
Replace each terminal edge “ $t$ ” with the Template automaton for $t$ .
Propagate the vocab-token bitset weight onto the inserted structure.
When a single vocab token corresponds to multiple terminals, you’re effectively concatenating multiple templates—so apply push/pop cancellation.
Determinize the result to get a final automaton that:
- reads LR stack state IDs as its input alphabet
- outputs valid vocabulary tokens as a bitset weight

This final structure is what I call the Parser DWA.

Why pushes can be dropped (for masking)

For get_mask, we only care whether a token is valid, not exactly what new stack would result.

The actual stack update is handled by commit(token) using the real lexer+parser.

So during compilation, after composing and canceling, any remaining “push” symbols that represent “the parser would push state X” can be discarded from the mask automaton. The runtime commit will do the true stack manipulation.

This is also what lets you do useful things like: start from an arbitrary prompt prefix (maybe partially through a JSON value) and commit raw bytes without needing to align to LLM token boundaries.

Bounded lookback (why we don’t read the whole stack)

There are two complementary views here:

View 1: reduction chains must be bounded (after normalization)

A parser can, in principle, take arbitrarily long chains of reductions without consuming input if the grammar has certain recursive patterns (right recursion / hidden left recursion). There’s classic work (Aycock & Horspool) characterizing when reduction chains can be unbounded and how to transform grammars to avoid it.

After the usual normalizations, the number of reductions you need to consider before a shift becomes bounded by a constant (for that grammar). That implies: deciding token validity only requires looking at a bounded suffix of the stack.

View 2: viable prefixes are regular

Classical LR theory says viable prefixes form a regular language and are recognized by a finite automaton (the LR(0) automaton). Our Parser DWA is essentially a weighted extension of that automaton: instead of merely accepting/rejecting a stack configuration, it computes “which vocab tokens extend it.”

Regularity is the deep reason “finite automaton reading stack suffix” is even possible.

Tokenization ambiguity and longest-match (streaming lexing is forward-looking)

There’s one more annoying real-world detail: the grammar lexer is streaming, and “token boundaries” don’t align with LLM tokens.

Even with longest-match lexing (“maximal munch”), you can’t always commit a terminal the moment it matches, because it might still be extendable.

Example:

You’ve seen "+" so far. That matches PLUS.
But if the next byte is "+", then the correct token is INCREMENT ("++"), not PLUS then PLUS.

In a traditional lexer, you solve this with lookahead on the raw input stream. In incremental decoding, the raw input arrives token-by-token, so you need a streaming strategy.

Inhibited terminals (one workable strategy)

When a terminal $t$ matches at some point but is still extendable, fork the lexer/parser state:

Branch A: commit $t$ now, but mark it as inhibited at this position (so you don’t immediately rematch it in a way that violates longest-match).
Branch B: wait, hoping to extend to a longer terminal.

If later bytes allow a longer match, prune the premature branch.

This creates multiple active lexer/parser states—exactly what GLR + GSS machinery is good at representing compactly.

Masking with ambiguity

If you have multiple active parse configurations (due to lexer ambiguity, inhibited terminals, or LR conflicts), then:

an LLM token is allowed if it is allowed in any active configuration.

So get_mask unions the bitsets across branches.

That “union across alternatives” is exactly why the weighted automaton story uses union as the merge operator.

Putting it together: fast `get_mask`, honest `commit`

At runtime you do two very different kinds of work:

`commit(token)` (parser work)

Feed the chosen token’s bytes into the lexer.
Emit terminals when they’re recognized (possibly with the inhibited-terminals forking).
Update the GLR GSS via LR shift/reduce actions.
Prune error paths; merge equivalent states.

This is “real parsing.” It can be complicated and it inherits the usual GLR theoretical caveats.

`get_mask(state)` (automaton work)

Treat the current GSS stacks as input (parser state IDs).
Run the Parser DWA over those stacks:
- table lookup transitions
- bitset AND along stack depth
- bitset OR across ambiguity merges
- stop early / bounded lookback
Return a vocabulary mask bitset.

This is the crucial separation: mask generation never simulates reduction chains at runtime. That work has been compiled away into the automaton.

Why this feels close to optimal (without claiming a theorem)

I’m not going to claim a formal optimality theorem here; parsing theory is too spiky for that.

But the division of labor feels “asymptotically right”:

On the commit side, you’re doing incremental parsing. GLR+GSS is a well‑worn tool for ambiguity, and it’s hard to imagine doing substantially less work than “update parse state with actual observed input.”
On the mask side, the Parser DWA does the minimum kind of work you’d hope for:
- read a bounded slice of parser configuration
- do no backtracking
- do no per‑vocab iteration other than bitset ops
- do predictable inner loops

Also: you can’t avoid touching $|V|$ bits if the output is literally a $|V|$ -bit mask. The goal is to make that bit touching occur in a small, fixed number of AND/ORs, not inside a giant vocabulary loop with parsing simulation.

Fast run, slow compile

Everything above punts on the hardest engineering fact:

building these automata can blow up if you’re careless.

You’re doing determinization and minimization in a weighted / semiring-ish setting, and real grammars are large.

Most of my time went into making compilation time and memory sane:

controlling subset construction blowups
exploiting cancellation aggressively
keeping bitsets compact / deduplicated
normalizing grammars to bound reduction chains
making tokenizer/lexer simulation efficient across the vocab trie

That’s a whole separate writeup.

Summary

If you only remember one idea from this post, it’s this:

Don’t compute a mask by “trying tokens” (even via tries) and simulating parsing at runtime.
Do precompute a weighted automaton that maps “parser stack → allowed vocab tokens”.
At runtime, compute masks by reading a bounded stack suffix and doing bitset AND/OR.

The result is a get_mask whose runtime behavior is predictable enough to care about in p99.9 land, while commit remains a standard incremental GLR parser update.

If/when I write the follow-up, it’ll be about the compile-time tricks: building the Terminal DWA, building the per-terminal templates, composing/canceling deterministically, and keeping the result small.